Systematic Clustering of Transcription Start Site Landscapes
نویسندگان
چکیده
Genome-wide, high-throughput methods for transcription start site (TSS) detection have shown that most promoters have an array of neighboring TSSs where some are used more than others, forming a distribution of initiation propensities. TSS distributions (TSSDs) vary widely between promoters and earlier studies have shown that the TSSDs have biological implications in both regulation and function. However, no systematic study has been made to explore how many types of TSSDs and by extension core promoters exist and to understand which biological features distinguish them. In this study, we developed a new non-parametric dissimilarity measure and clustering approach to explore the similarities and stabilities of clusters of TSSDs. Previous studies have used arbitrary thresholds to arrive at two general classes: broad and sharp. We demonstrated that in addition to the previous broad/sharp dichotomy an additional category of promoters exists. Unlike typical TATA-driven sharp TSSDs where the TSS position can vary a few nucleotides, in this category virtually all TSSs originate from the same genomic position. These promoters lack epigenetic signatures of typical mRNA promoters and a substantial subset of them are mapping upstream of ribosomal protein pseudogenes. We present evidence that these are likely mapping errors, which have confounded earlier analyses, due to the high similarity of ribosomal gene promoters in combination with known G addition bias in the CAGE libraries. Thus, previous two-class separations of promoter based on TSS distributions are motivated, but the ultra-sharp TSS distributions will confound downstream analyses if not removed.
منابع مشابه
Mycobacterium avium subsp. paratuberculosis induces differential cytosine methylation at miR-21 transcription start site region
Mycobacterium aviumsubspecies paratuberculosis (MAP), as an obligate intracellular bacterium, causes paratuberculosis (Johne’s disease) in ruminants. Plus, MAP has consistently been isolated from Crohn’s disease (CD) lesions in humans; a notion implying possible direct causative ...
متن کاملDesign a Hybrid Recommender System Solving Cold-start Problem Using Clustering and Chaotic PSO Algorithm
One of the main challenges of increasing information in the new era, is to find information of interest in the mass of data. This important matter has been considered in the design of many sites that interact with users. Recommender systems have been considered to resolve this issue and have tried to help users to achieve their desired information; however, they face limitations. One of the mos...
متن کاملAutomatic Procedures for Compilation of Promoter sequences and their Evaluation based on Signal Content and Positional Distributions
Collections of precisely mapped eukaryotic transcription start sites (TSS) are important resources for studying gene control elements and for developing promoter prediction algorithms. The present study consists of two parts. First, we present a new clustering method to infer TSS positions from EST 5' ends. Second, we present a quality evaluation of the promoter sequences defined by TSS positio...
متن کاملTSSi - an R package for transcription start site identification from 5′ mRNA tag data
UNLABELLED High-throughput sequencing has become an essential experimental approach for the investigation of transcriptional mechanisms. For some applications like ChIP-seq, several approaches for the prediction of peak locations exist. However, these methods are not designed for the identification of transcription start sites (TSSs) because such datasets contain qualitatively different noise. ...
متن کاملWhole-genome mapping of 5' RNA ends in bacteria by tagged sequencing: a comprehensive view in Enterococcus faecalis.
Enterococcus faecalis is the third cause of nosocomial infections. To obtain the first snapshot of transcriptional organizations in this bacterium, we used a modified RNA-seq approach enabling to discriminate primary from processed 5' RNA ends. We also validated our approach by confirming known features in Escherichia coli. We mapped 559 transcription start sites (TSSs) and 352 processing sites...
متن کامل